We start by importing the AnnData object we downloaded from the Human Developmental Cell Atlas portal; it contains a variety of celltypes beyond just the myeloid lineage which we’re interested.
We’ll begin our analysis by running our cells through a standard preprocessing pipeline composed of QC, integration, normalization, HVG selection, dimension reduction, & clustering.
4.2.1 QC
We filter out cells with a sequencing depth of less than 1,000, then remove out genes expressed in less than 5 cells.
Using the scVI library, we perform celltype-aware integration across batches by training a variational autoencoder (VAE) for 150 epochs. This provides us with a 20-dimensional integrated latent space embedding that is (hopefully) free of batch effects.
We use the PAGA algorithm to generate a graph abstraction of the relationships between celltypes.
Code
sc.tl.paga(ad_myeloid, groups='celltype')
Visualizing the results shows a pretty linear structure that matches what we expect biologically - stem cells at the root, monocytes & macrophages in the middle, and Kupffer cells at the end.
Overall the diffusion pseudotime captures myeloid differentiation progression well, but it assigns very similar values to all the mature Kupffer cells which isn’t ideal.
We can now create a Seurat object, to which we add our various embeddings and metadata. After adding all the various bits and pieces, we normalize the raw counts and re-identify HVGs.
The runtime for tradeSeq with 5 knots per gene is:
Code
ts_diff
Time difference of 2.234959 hours
5.5 Cell fate analysis
Lastly, we’ll perform an analysis of cell fate specification using the CellRank Python package.
5.5.1 CytoTRACE kernel
We begin by computing a CytoTRACE kernel; CytoTRACE uses the number of genes expressed in a cell as a proxy measurement for differentiation potential, with the assumption being that cells expressing more genes are more likely to be progenitor celltypes & vice versa.
We plot the CytoTRACE scores below. Interestingly, in addition to the expected result of HSCs having high CytoTRACE scores we see a subcluster of Kupffer cells with high scores as well.
Plotting the projection of that matrix on our force-directed graph embedding shows messy / reversed directionality. This indicates that CytoTRACE might not work very well on this dataset, and at the very least should be augmented with another data source.
/blue/rbacher/j.leary/py_envs/scLANE_env2/lib/python3.10/site-packages/scvelo/plotting/utils.py:63: DeprecationWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, pd.CategoricalDtype) instead
return isinstance(c, str) and c in data.obs.keys() and cat(data.obs[c])
/blue/rbacher/j.leary/py_envs/scLANE_env2/lib/python3.10/site-packages/scvelo/plotting/utils.py:63: DeprecationWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, pd.CategoricalDtype) instead
return isinstance(c, str) and c in data.obs.keys() and cat(data.obs[c])
Plotting the projection of the pseudotime kernel’s transition probability matrix shows a much more reasonable portrayal of differentiation directionality.
Figure 12: Force-directed graph embedding overlaid with streamline arrows derived from Slingshot pseudotime.
5.5.3 Combined kernel
We combine the two kernel using unequal weighting, with the pseudotime kernel being given priority due to its better overall capture of differentiation.
Code
ck =0.2* ctk +0.8* pk
We visualize the combined kernel’s projection below. It seems to reliably portray the direction of differentiation in our dataset.
Figure 13: Force-directed graph embedding overlaid with streamline arrows derived from the combined kernel.
We can also simulate random walks on the high-dimensional cell-cell graph starting from the HSC population. We see that nearly all of the random walks terminate in the Kupffer cell population.
Figure 14: Force-directed graph embedding overlaid with random walks along the cell-cell graph derived from the combined kernel. Starting points are highlighted in black and ending points in yellow.
6 Save data
Finally, we save our Seurat object, the output from Slingshot, and the models from scLANE and tradeSeq.
---title: "Processing Pipeline for Hematopoesis Data from Popescu *et al* (2019)"subtitle: "Bacher Group"date: todaydate-format: longauthor: - name: Jack R. Leary orcid: 0009-0004-8821-3269 email: j.leary@ufl.edu corresponding: true affiliation: - name: University of Florida department: Department of Biostatistics city: Gainesville state: FL country: USformat: html: theme: journal highlight-style: tango toc: true toc-depth: 2 toc-location: left code-copy: hover code-tools: true code-fold: show embed-resources: true number-sections: true fig-width: 6 fig-height: 4 fig-dpi: 320 fig-align: center fig-cap-location: bottom tbl-cap-location: bottomeditor: source---```{r setup}#| include: falseknitr::opts_chunk$set(comment = NA)reticulate::use_condaenv("../../../../py_envs/scLANE_env2/", required = TRUE)options(future.globals.maxSize = 1e20)set.seed(312) # lucky seed```# Libraries {#sec-libs}Before we do anything else we load in the necessary software packages in both R & Python.## R```{r}#| message: false#| warning: falselibrary(dplyr) # data manipulationlibrary(Seurat) # scRNAseq toolslibrary(scLANE) # trajectory DElibrary(ggplot2) # pretty plotslibrary(biomaRt) # gene annotationlibrary(tradeSeq) # more trajectory DElibrary(patchwork) # plot alignmentlibrary(slingshot) # pseudotime estimationlibrary(reticulate) # Python interfaceselect <- dplyr::selectrename <- dplyr::rename```## Python```{python}#| message: false#| warning: falseimport scvi # integrationimport warnings # filter out warningsimport numpy as np # linear algebra toolsimport scanpy as sc # scRNA-seq toolsimport pandas as pd # DataFramesimport anndata as ad # scRNA-seq data structuresimport scvelo as scv # RNA velocityimport cellrank as cr # cell fate estimationimport matplotlib.pyplot as plt # pretty plotsfrom scipy.io import mmread, mmwrite # sparse matrix IOfrom cellrank.estimators import GPCCA # CellRank estimatorfrom scipy.sparse import coo_matrix, csr_matrix # sparse matricesfrom cellrank.kernels import PseudotimeKernel, CytoTRACEKernel # CellRank kernels```Here we set a few global variables to reduce the amount of printed output caused by our Python code. ```{python}warnings.simplefilter('ignore', category=UserWarning)sc.settings.verbosity =0scv.settings.verbosity =0cr.settings.verbosity =0```# Visualization tools {#sec-viz-tools}To make our visualizations prettier we'll define some consistent color palettes, and set a theme for `matplotlib` that matches `ggplot2::theme_classic()`. ## Color palettes```{r}palette_heatmap <- paletteer::paletteer_d("MetBrewer::Cassatt1", direction =-1)palette_celltype <-as.character(paletteer::paletteer_d("ggsci::category10_d3"))[1:6]names(palette_celltype) <-c("HSC", "Kupffer cell", "Monocyte-macrophage", "Monocyte", "Monocyte precursor", "Neutrophil-myeloid progenitor")palette_cluster <- paletteer::paletteer_d("ggsci::default_igv")```## Theme for `matplotlib````{python}base_size =12plt.rcParams.update({# font'font.size': base_size, 'font.weight': 'normal',# figure'figure.dpi': 320, 'figure.edgecolor': 'white', 'figure.facecolor': 'white', 'figure.figsize': (6, 4), 'figure.constrained_layout.use': True,# axes'axes.edgecolor': 'black','axes.grid': False,'axes.labelpad': 2.75,'axes.labelsize': base_size *0.8,'axes.linewidth': 1.5,'axes.spines.right': False,'axes.spines.top': False,'axes.titlelocation': 'left','axes.titlepad': 11,'axes.titlesize': base_size,'axes.titleweight': 'normal','axes.xmargin': 0.1, 'axes.ymargin': 0.1, # legend'legend.borderaxespad': 1,'legend.borderpad': 0.5,'legend.columnspacing': 2,'legend.fontsize': base_size *0.8,'legend.frameon': False,'legend.handleheight': 1,'legend.handlelength': 1.2,'legend.labelspacing': 1,'legend.title_fontsize': base_size, 'legend.markerscale': 1.5})```# Helper functions {#sec-fns}We first define a utility function to make our plot legends cleaner to read & easier to make. ```{r}guide_umap <-function(key.size =4) { ggplot2::guides(color = ggplot2::guide_legend(override.aes =list(size = key.size,alpha =1, stroke =0.25)))}```# Data {#sec-data}## Import hematopoesis dataWe start by importing the `AnnData` object we downloaded from the Human Developmental Cell Atlas portal; it contains a variety of celltypes beyond just the myeloid lineage which we're interested. ```{python}ad_liver = ad.read_h5ad('../../data/hematopoesis/fetal_liver_alladata.h5ad')```We subset to just the myeloid lineage, and remove cells with a percentage mitochondrial reads greater than 15%.```{python}ad_myeloid = ad_liver[ad_liver.obs['cell.labels'].isin(['HSC_MPP', 'Neutrophil-myeloid progenitor', 'Monocyte precursor', 'Monocyte', 'Mono-Mac', 'Kupffer Cell'])]ad_myeloid.obs['celltype'] = ( ad_myeloid.obs['cell.labels'] .map(lambda x: {'HSC_MPP': 'HSC', 'Mono-Mac': 'Monocyte-macrophage', 'Kupffer Cell': 'Kupffer cell'}.get(x, x)) .astype('category'))ad_myeloid = ad_myeloid[ad_myeloid.obs['percent.mito'] <0.15]ad_myeloid.layers['counts'] = ad_myeloid.X.copy() ad_myeloid.layers['raw_counts'] = ad_myeloid.X.copy() ad_myeloid.layers['spliced'] = ad_myeloid.X.copy() ad_myeloid.layers['unspliced'] = ad_myeloid.X.copy()ad_myeloid.raw = ad_myeloid```## Preprocessing {#sec-preprocess}We'll begin our analysis by running our cells through a standard preprocessing pipeline composed of QC, integration, normalization, HVG selection, dimension reduction, & clustering. ### QCWe filter out cells with a sequencing depth of less than 1,000, then remove out genes expressed in less than 5 cells. ```{python}sc.pp.filter_cells(ad_myeloid, min_counts=1000)sc.pp.filter_genes(ad_myeloid, min_cells=5)```### HVG selectionHere we select the top 4000 most highly-variable genes (HVGs), which we'll use as input to PCA, integration, etc. ```{python}sc.pp.highly_variable_genes( ad_myeloid, n_top_genes=4000, flavor='seurat_v3', layer='counts', subset=True)```### Integration with `scVI`Using the `scVI` library, we perform celltype-aware integration across batches by training a variational autoencoder (VAE) for 150 epochs. This provides us with a 20-dimensional integrated latent space embedding that is (hopefully) free of batch effects. ```{python}#| results: hidescvi.settings.seed =312scvi.settings.num_threads =16scvi.model.SCVI.setup_anndata( ad_myeloid, layer='counts', batch_key='fetal.ids', labels_key='celltype')int_model = scvi.model.SCVI( ad_myeloid, n_layers=2, n_hidden=72, n_latent=20, gene_likelihood='nb', dispersion='gene-label')int_model.train( early_stopping=True, accelerator='cpu', max_epochs=150)ad_myeloid.obsm['X_scVI'] = int_model.get_latent_representation()``````{python}#| code-fold: true#| message: false#| warning: false#| fig-cap: scVI embedding colored by celltype. #| label: fig-scvi-embedsc.pl.embedding( ad_myeloid, basis='scVI', color='celltype', title='', frameon=True, size=30, alpha=0.5, show=False)plt.gca().set_xlabel('scVI 1')plt.gca().set_ylabel('scVI 2')plt.show()```### NormalizationWe next depth- and log1p-normalize the mRNA counts matrix. ```{python}ad_myeloid.X = sc.pp.normalize_total(ad_myeloid, target_sum=1e4, inplace=False)['X']sc.pp.log1p(ad_myeloid)```### PCA embeddingUsing PCA we generate a 50-dimensional linear reduction of the normalized counts. ```{python}sc.pp.scale(ad_myeloid)sc.tl.pca( ad_myeloid, n_comps=50, random_state=312, use_highly_variable=True)```Plotting the first two PCs shows clear separation by celltype, though variation in the immature celltypes seems less clear. ```{python}#| code-fold: true#| message: false#| warning: false#| fig-cap: PCA embedding colored by celltype. #| label: fig-pca-embedsc.pl.embedding( ad_myeloid, basis='pca', color='celltype', title='', frameon=True, size=30, alpha=0.5, show=False)plt.gca().set_xlabel('PC 1')plt.gca().set_ylabel('PC 2')plt.show()```### SNN graph estimationNext, we identify the 50 nearest-neighbors (NNs) for each cell using the cosine distance in the `scVI` latent space. ```{python}sc.pp.neighbors( ad_myeloid, n_neighbors=50, n_pcs=None, metric='cosine', random_state=312, use_rep='X_scVI')```### ClusteringUsing the Leiden algorithm we partition the SNN graph into clusters. ```{python}sc.tl.leiden( ad_myeloid, resolution=0.3, random_state=312)```The identified clusters roughly match our celltypes. ```{python}#| code-fold: true#| message: false#| warning: false#| fig-cap: PCA embedding colored by Leiden cluster. #| label: fig-pca-leidensc.pl.embedding( ad_myeloid, basis='pca', color='leiden', title='', frameon=True, size=30, alpha=0.5, show=False)plt.gca().set_xlabel('PC 1')plt.gca().set_ylabel('PC 2')plt.show()```### Moment-based imputationMoving on, we create smoothed versions of the counts across each cell's 30 NNs. ```{python}scv.pp.moments( ad_myeloid, n_pcs=None, use_rep='X_scVI', n_neighbors=30)```### Undirected PAGA embeddingWe use the PAGA algorithm to generate a graph abstraction of the relationships between celltypes. ```{python}sc.tl.paga(ad_myeloid, groups='celltype')```Visualizing the results shows a pretty linear structure that matches what we expect biologically - stem cells at the root, monocytes & macrophages in the middle, and Kupffer cells at the end. ```{python}#| code-fold: true#| message: false#| warning: false#| fig-cap: PAGA embedding colored by celltype. Lines are weighted by (undirected) connectivity strength. #| label: fig-paga-embedsc.pl.paga( ad_myeloid, fontoutline=True, fontsize=12, frameon=True, random_state=312, show=False)plt.gca().set_xlabel('PAGA 1')plt.gca().set_ylabel('PAGA 2')plt.show()```### UMAP embeddingWe then create a nonlinear dimension reduction of the data using the UMAP algorithm. ```{python}sc.tl.umap(ad_myeloid, random_state=312)```In UMAP space the transitions between celltypes are clear and well-represented. ```{python}#| code-fold: true#| message: false#| warning: false#| fig-cap: UMAP embedding colored by celltype. #| label: fig-umap-embedsc.pl.embedding( ad_myeloid, basis='umap', color='celltype', title='', frameon=True, size=30, alpha=0.5, show=False)plt.gca().set_xlabel('UMAP 1')plt.gca().set_ylabel('UMAP 2')plt.show()```### Force-directed graph embeddingUsing the ForceAtlas2 algorithm we generate a 2-dimensional graph layout of our cells. ```{python}#| message: false#| warning: falsesc.tl.draw_graph( ad_myeloid, layout='fa', random_state=312)```The force-directed graph embedding is more visually appealing and cleaner than the UMAP embedding, so we'll most likely use it instead going forwards. ```{python}#| code-fold: true#| message: false#| warning: false#| fig-cap: Force-directed graph embedding colored by celltype. #| label: fig-graph-embedsc.pl.draw_graph( ad_myeloid, color='celltype', title='', size=30, alpha=0.5, show=False, frameon=True)plt.gca().set_xlabel('FA 1')plt.gca().set_ylabel('FA 2')plt.show()```### Diffusion map embeddingOur last embedding will be generated using diffusion maps, which explicitly attempts to preserve transitional structures in the data. ```{python}sc.tl.diffmap( ad_myeloid, random_state=312, n_comps=16)ad_myeloid.obsm['X_diffmap_old'] = ad_myeloid.obsm['X_diffmap']ad_myeloid.obsm['X_diffmap'] = ad_myeloid.obsm['X_diffmap'][:, 1:] ```While the diffusion map embedding captures some characteristics of the data, it masks variation in the mature macrophage celltypes.```{python}#| code-fold: true#| message: false#| warning: false#| fig-cap: Diffusion map embedding colored by celltype. #| label: fig-diffmap-embedsc.pl.embedding( ad_myeloid, basis='diffmap', color='celltype', title='', frameon=True, size=30, alpha=0.5, components='1,2', show=False)plt.gca().set_xlabel('DC 1')plt.gca().set_ylabel('DC 2')plt.show()```### Diffusion pseudotime estimationWe next compute a diffusion pseudotime estimate for each cell using the first diffusion component. ```{python}ad_myeloid.uns['iroot'] = np.argmin(ad_myeloid.obsm['X_diffmap'][:, 0])sc.tl.dpt(ad_myeloid, n_dcs=1)```Overall the diffusion pseudotime captures myeloid differentiation progression well, but it assigns very similar values to all the mature Kupffer cells which isn't ideal. ```{python}#| code-fold: true#| message: false#| warning: false#| fig-cap: Diffusion map embedding colored by diffusion pseudotime.#| label: fig-diffmap-pseudotimesc.pl.embedding( ad_myeloid, basis='diffmap', color='dpt_pseudotime', title='', frameon=True, size=30, alpha=0.5, components='1,2', show=False)plt.gca().set_xlabel('DC 1')plt.gca().set_ylabel('DC 2')plt.show()```## Save preprocessed dataWe save the processed `AnnData` object to disk. ```{python}ad_myeloid.write_h5ad('../../data/hematopoesis/ad_myeloid.h5ad')```We also save the embeddings, counts matrix, metadata, etc. so that we can create a `Seurat` object. ```{python}ad_myeloid.obs.reset_index(inplace=False).rename(columns={'index': 'cell'}).to_csv('../../data/hematopoesis/conversion_files/cell_metadata.csv')ad_myeloid.var.reset_index(inplace=False).rename(columns={'index': 'gene'}).to_csv('../../data/hematopoesis/conversion_files/gene_metadata.csv')pd.DataFrame(ad_myeloid.obsm['X_draw_graph_fa']).reset_index(inplace=False).to_csv('../../data/hematopoesis/conversion_files/graph_embed.csv')pd.DataFrame(ad_myeloid.obsm['X_umap']).reset_index(inplace=False).to_csv('../../data/hematopoesis/conversion_files/umap_embed.csv')pd.DataFrame(ad_myeloid.obsm['X_pca']).reset_index(inplace=False).to_csv('../../data/hematopoesis/conversion_files/pca_embed.csv')pd.DataFrame(ad_myeloid.obsm['X_scVI']).reset_index(inplace=False).to_csv('../../data/hematopoesis/conversion_files/scvi_embed.csv')pd.DataFrame(ad_myeloid.obsm['X_diffmap']).reset_index(inplace=False).to_csv('../../data/hematopoesis/conversion_files/diffmap_embed.csv')mmwrite('../../data/hematopoesis/conversion_files/raw_counts.mtx', a=ad_myeloid.layers['raw_counts'])```# Analysis {#sec-analysis}## Read data into RWe start our analysis by importing the preprocessed data from Python into R. Expand the code block for details. ```{r}#| message: false#| warning: false#| code-fold: truecell_metadata <- readr::read_csv("../../data/hematopoesis/conversion_files/cell_metadata.csv", show_col_types =FALSE, col_select =-1) %>%as.data.frame() %>% magrittr::set_rownames(.$cell)gene_metadata <- readr::read_csv("../../data/hematopoesis/conversion_files/gene_metadata.csv", show_col_types =FALSE, col_select =-1)embed_PCA <- readr::read_csv("../../data/hematopoesis/conversion_files/pca_embed.csv",show_col_types =FALSE, col_select =-c(1:2)) %>%as.data.frame() %>% magrittr::set_colnames(paste0("PC", 1:ncol(.))) %>% magrittr::set_rownames(cell_metadata$cell)embed_PCA <-CreateDimReducObject(embeddings =as.matrix(embed_PCA),key ="PC_", global =TRUE, assay ="RNA")embed_FA <- readr::read_csv("../../data/hematopoesis/conversion_files/graph_embed.csv", show_col_types =FALSE, col_select =-c(1:2)) %>%as.data.frame() %>% magrittr::set_colnames(paste0("FA", 1:ncol(.))) %>% magrittr::set_rownames(cell_metadata$cell)embed_FA <-CreateDimReducObject(embeddings =as.matrix(embed_FA),key ="FA_", global =TRUE, assay ="RNA")embed_UMAP <- readr::read_csv("../../data/hematopoesis/conversion_files/umap_embed.csv", show_col_types =FALSE, col_select =-c(1:2)) %>%as.data.frame() %>% magrittr::set_colnames(paste0("UMAP", 1:ncol(.))) %>% magrittr::set_rownames(cell_metadata$cell)embed_UMAP <-CreateDimReducObject(embeddings =as.matrix(embed_UMAP),key ="UMAP_", global =TRUE, assay ="RNA")embed_diffmap <- readr::read_csv("../../data/hematopoesis/conversion_files/diffmap_embed.csv", show_col_types =FALSE, col_select =-c(1:2)) %>%as.data.frame() %>% magrittr::set_colnames(paste0("DC", 1:ncol(.))) %>% magrittr::set_rownames(cell_metadata$cell)embed_diffmap <-CreateDimReducObject(embeddings =as.matrix(embed_diffmap),key ="DC_", global =TRUE, assay ="RNA")embed_scVI <- readr::read_csv("../../data/hematopoesis/conversion_files/scvi_embed.csv", show_col_types =FALSE, col_select =-c(1:2)) %>%as.data.frame() %>% magrittr::set_colnames(paste0("scVI", 1:ncol(.))) %>% magrittr::set_rownames(cell_metadata$cell)embed_scVI <-CreateDimReducObject(embeddings =as.matrix(embed_scVI),key ="scVI_", global =TRUE, assay ="RNA")raw_counts <- Matrix::t(Matrix::readMM("../../data/hematopoesis/conversion_files/raw_counts.mtx"))colnames(raw_counts) <- cell_metadata$cellrownames(raw_counts) <- gene_metadata$generaw_counts <-CreateAssayObject(counts = raw_counts,min.cells =0,min.features =0, key ="rna")```We can now create a `Seurat` object, to which we add our various embeddings and metadata. After adding all the various bits and pieces, we normalize the raw counts and re-identify HVGs. ```{r}#| message: false#| warning: falseseu_myeloid <-CreateSeuratObject(raw_counts, assay ="RNA",meta.data = cell_metadata, project ="hematopoesis", min.cells =0, min.features =0)seu_myeloid@meta.data <-mutate(seu_myeloid@meta.data, celltype =factor(celltype, levels =c("HSC", "Kupffer cell", "Monocyte-macrophage", "Monocyte", "Monocyte precursor", "Neutrophil-myeloid progenitor")))seu_myeloid@reductions$pca <- embed_PCAseu_myeloid@reductions$diffmap <- embed_diffmapseu_myeloid@reductions$fa <- embed_FAseu_myeloid@reductions$umap <- embed_UMAPseu_myeloid@reductions$scvi <- embed_scVIseu_myeloid <-NormalizeData(seu_myeloid, verbose =FALSE) %>%FindVariableFeatures(selection.method ="vst", verbose =FALSE)```## Pseudotime estimationUsing `Slingshot` we identify a single lineage of pseudotime across the force-directed graph embedding. ```{r}#| message: false#| warning: falsesling_res <-slingshot(seu_myeloid@reductions$fa@cell.embeddings[, 1:2], clusterLabels =as.factor(seu_myeloid$celltype), start.clus =c("HSC"), approx_points =1000)sling_curves <-slingCurves(sling_res, as.df =TRUE)sling_mst <-slingMST(sling_res, as.df =TRUE)sling_pt <-slingPseudotime(sling_res) %>%as.data.frame() %>%mutate(cell =rownames(.), .before =1) %>%rename(PT = Lineage1) %>%mutate(PT = (PT -min(PT)) / (max(PT) -min(PT)))```Visualizing the results from `Slingshot` shows us that our biological expectations have been met. ```{r}#| code-fold: true#| fig-cap: Force-directed graph embedding overlaid with the MST and principal curves estimated by Slingshot. #| label: fig-fa-slingp0 <-data.frame(Embeddings(seu_myeloid, "fa")) %>%mutate(celltype = seu_myeloid$celltype) %>%ggplot(aes(x = FA_1, y = FA_2, color = celltype)) +geom_point(size =1.5, alpha =0.5, stroke =0) +geom_path(data = sling_mst, mapping =aes(x = FA_1, y = FA_2, group = Lineage), linewidth =1.25, color ="black") +geom_point(data = sling_mst, mapping =aes(x = FA_1, y = FA_2, fill = Cluster), color ="black", shape =21, size =4.5, stroke =1.25, show.legend =FALSE) +scale_color_manual(values = palette_celltype) +scale_fill_manual(values = palette_celltype) +labs(x ="FA 1", y ="FA 2") +theme_scLANE(umap =TRUE) +theme(legend.title =element_blank()) +guide_umap()p1 <-data.frame(Embeddings(seu_myeloid, "fa")) %>%mutate(celltype = seu_myeloid$celltype) %>%ggplot(aes(x = FA_1, y = FA_2, color = celltype)) +geom_point(size =1.5, alpha =0.5, stroke =0) +geom_path(data = sling_curves,mapping =aes(x = FA_1, y = FA_2, group = Lineage), color ="black", linewidth =1.5, alpha =0.5, lineend ="round") +scale_color_manual(values = palette_celltype) +labs(x ="FA 1", y ="FA 2") +theme_scLANE(umap =TRUE) +theme(legend.title =element_blank()) +guide_umap()p2 <- (p0 / p1) +plot_layout(guides ="collect", axes ="collect")p2```## Trajectory DE testing with `scLANE`We use the `scLANE` package to perform trajectory DE testing across pseudotime time for the top 4000 HVGs.```{r}cell_offset <-createCellOffset(seu_myeloid)pt_df <-data.frame(PT = sling_pt$PT)candidate_genes <-chooseCandidateGenes(seu_myeloid, group.by.subject =FALSE, n.desired.genes =4000L)scLANE_models <-testDynamic(seu_myeloid, pt = pt_df, genes = candidate_genes, size.factor.offset = cell_offset, n.cores =16L,verbose =FALSE)scLANE_de_res <-getResultsDE(scLANE_models)```## `tradeSeq` DE analysis```{r}ts_start <-Sys.time()bioc_par <- BiocParallel::MulticoreParam(workers =16L, RNGseed =312)RNA_counts <-as.matrix(seu_myeloid@assays$RNA$counts)[candidate_genes, ]k_eval <-evaluateK(RNA_counts, pseudotime = pt_df, cellWeights =matrix(rep(1, nrow(pt_df)), ncol =1), offset =log(1/ cell_offset), k =3:10, plot =FALSE, nGenes =500, verbose =FALSE, parallel =TRUE, BPPARAM = bioc_par)best_k <-c(3:10)[which.min(abs(colMeans(k_eval -rowMeans(k_eval))))] # choose k w/ lowest MAD from mean AIC ts_models <-fitGAM(RNA_counts, pseudotime = pt_df, cellWeights =matrix(rep(1, nrow(pt_df)), ncol =1), offset =log(1/ cell_offset), nknots = best_k, sce =FALSE, parallel =TRUE, verbose =FALSE, BPPARAM = bioc_par)BiocParallel::bpstop(bioc_par)ts_de_res <-associationTest(ts_models, global =TRUE) %>%arrange(desc(waldStat)) %>%mutate(gene =rownames(.), pvalue_adj =p.adjust(pvalue, method ="fdr"), gene_dynamic_overall =if_else(pvalue_adj <0.01, 1, 0)) %>%relocate(gene)ts_end <-Sys.time()ts_diff <- ts_end - ts_start```The runtime for `tradeSeq` with `r best_k` knots per gene is: ```{r}ts_diff```## Cell fate analysisLastly, we'll perform an analysis of cell fate specification using the `CellRank` Python package. ### CytoTRACE kernel We begin by computing a CytoTRACE kernel; CytoTRACE uses the number of genes expressed in a cell as a proxy measurement for differentiation potential, with the assumption being that cells expressing more genes are more likely to be progenitor celltypes & vice versa. ```{python}#| results: hidectk = CytoTRACEKernel(ad_myeloid).compute_cytotrace()```We plot the CytoTRACE scores below. Interestingly, in addition to the expected result of HSCs having high CytoTRACE scores we see a subcluster of Kupffer cells with high scores as well. ```{python}#| code-fold: true#| fig-cap: Force-directed graph embedding colored by CytoTRACE score and pseudotime. Higher scores indicate higher differentiation potential. #| label: fig-fa-cytotracesc.pl.embedding( ad_myeloid, basis='draw_graph_fa', color=['ct_score', 'ct_pseudotime'], color_map='magma', show=False, size=30, alpha=0.5, title=['CytoTRACE score', 'CytoTRACE Pseudotime'])plt.show()```Next, we estimate a cell-cell transition probability matrix. ```{python}#| results: hidectk.compute_transition_matrix(threshold_scheme='soft', show_progress_bar=False)```Plotting the projection of that matrix on our force-directed graph embedding shows messy / reversed directionality. This indicates that CytoTRACE might not work very well on this dataset, and at the very least should be augmented with another data source. ```{python}#| code-fold: true#| fig-cap: Force-directed graph embedding streamline plot showing differentiation directions as inferred from CytoTRACE scores.#| label: fig-project-CTctk.plot_projection( basis='draw_graph_fa', recompute=True, color='celltype', title='', legend_loc='right margin', size=30, alpha=0.5, linewidth=2, frameon=True, show=False)plt.gca().set_xlabel('FA 1')plt.gca().set_ylabel('FA 2')plt.show() ```### Pseudotime kernelNext, we use the pseudotime estimates from `Slingshot` to create a pseudotime-based kernel and estimate another cell-cell transition probability matrix. ```{python}#| results: hidead_myeloid.obs['sling_PT'] = r.sling_pt['PT'].tolist()pk = PseudotimeKernel(ad_myeloid, time_key='sling_PT')pk.compute_transition_matrix(threshold_scheme='soft', show_progress_bar=False)```Plotting the projection of the pseudotime kernel's transition probability matrix shows a much more reasonable portrayal of differentiation directionality.```{python}#| code-fold: true#| message: false#| warning: false#| fig-cap: Force-directed graph embedding overlaid with streamline arrows derived from Slingshot pseudotime. #| label: fig-project-PTpk.plot_projection( basis='draw_graph_fa', recompute=True, color='celltype', title='', legend_loc='right margin', size=30, alpha=0.75, linewidth=2, frameon=True, show=False)plt.gca().set_xlabel('FA 1')plt.gca().set_ylabel('FA 2')plt.show()```### Combined kernelWe combine the two kernel using unequal weighting, with the pseudotime kernel being given priority due to its better overall capture of differentiation. ```{python}ck =0.2* ctk +0.8* pk```We visualize the combined kernel's projection below. It seems to reliably portray the direction of differentiation in our dataset.```{python}#| code-fold: true#| message: false#| warning: false#| fig-cap: Force-directed graph embedding overlaid with streamline arrows derived from the combined kernel. #| label: fig-project-combinedck.plot_projection( basis='draw_graph_fa', recompute=True, color='celltype', title='', legend_loc='right margin', size=30, alpha=0.75, linewidth=2, frameon=True, show=False)plt.gca().set_xlabel('FA 1')plt.gca().set_ylabel('FA 2')plt.show()```We can also simulate random walks on the high-dimensional cell-cell graph starting from the HSC population. We see that nearly all of the random walks terminate in the Kupffer cell population. ```{python}#| message: false#| warning: false#| fig-cap: Force-directed graph embedding overlaid with random walks along the cell-cell graph derived from the combined kernel. Starting points are highlighted in black and ending points in yellow. #| label: fig-RW-combinedck.plot_random_walks( n_sims=200, start_ixs={'celltype': 'HSC'}, basis='draw_graph_fa', color='celltype', legend_loc='right margin', seed=312, frameon=True, title='', linewidth=0.5, linealpha=0.25, size=30, alpha=0.5, ixs_legend_loc='lower', show_progress_bar=False)plt.gca().set_xlabel('FA 1')plt.gca().set_ylabel('FA 2')plt.show()```# Save data {#sec-save}Finally, we save our `Seurat` object, the output from `Slingshot`, and the models from `scLANE` and `tradeSeq`. ```{r}saveRDS(seu_myeloid, file ="../../data/hematopoesis/seu_myeloid.Rds")saveRDS(sling_res, file ="../../data/hematopoesis/sling_res.Rds")saveRDS(scLANE_models, file ="../../data/hematopoesis/scLANE_models.Rds")saveRDS(ts_models, file ="../../data/hematopoesis/ts_models.Rds")```# Session info {#sec-SI}```{r}sessioninfo::session_info()```